Add audio and video as native model input to file_read (currently images only)

## Summary

`file_read` can hand **images** (png/jpeg/gif/webp) to multimodal models, but **audio and video are recognized yet never reach the model** — they're delivered path-only. Add a delivery path so audio/video files can be supplied as native model input to providers that accept them.

## Current state

Audio and video MIME types are already catalogued and categorized (`MediaKind.Audio` / `MediaKind.Video`, `ModelModality.Audio` / `ModelModality.Video` exist), but every model-input gate is image-only:

- `MimeTypeCatalog` flags only the four image formats `SupportsModelInput = true`; audio (`mp3/m4a/wav/ogg`) and video (`mp4/mov/webm/mkv/avi`) are `false`.
- `SessionMediaStore.TryGetSupportedModelInput` **hardcodes** `MediaModality.Image` / `ModelModality.Image`.
- `AttachmentInlineDecision.Resolve` only inlines images.
- `FileReadTool.HandleNonTextFile` returns *"Audio transcription and video keyframe extraction are not built into file_read"* for media.
- `OpenAiCompatibleChatClient.ToMessage` only emits `image_url` content parts and **throws** on any non-image `DataContent`.

So even on a model that advertises `input: Text, Image, Video`, Netclaw never feeds it audio/video.

## Goal

When the active model accepts audio/video, `file_read` on a supported audio/video file should load it as model-visible input on the next call — the same flow images use today.

## Touchpoints (for whoever implements)

1. **`MimeTypeCatalog`** — flip `SupportsModelInput = true` for the specific audio/video formats we intend to deliver (selectively; not every codec is provider-ingestible).
2. **`SessionMediaStore.TryGetSupportedModelInput`** — stop hardcoding `Image`; map MIME → correct `MediaModality` + required `ModelModality` (Audio/Video).
3. **`AttachmentInlineDecision.Resolve`** — add audio/video branches (currently only images inline; everything else is path-only).
4. **`FileReadTool.HandleNonTextFile`** — register audio/video as model input on capable models instead of returning the "not built in" guidance.
5. **Chat client serialization** (`OpenAiCompatibleChatClient` + any other provider clients) — emit the right content parts. This is the hard part and is **provider-specific**:
   - **Audio:** OpenAI-compatible chat completions use `input_audio` (base64 + format). vLLM/others vary.
   - **Video:** OpenAI chat completions has **no** native video part; typically needs frame extraction. Gemini accepts native video. So video likely needs a provider-capability branch (or a frame-sampling fallback).
6. **Size/budget** — `MaxModelInputFileBytes` / `ModelInputBatchBudget` are tuned for ~images. Audio/video are far larger; revisit limits, and consider duration/sampling caps.
7. **Security** — magic-byte validation already exists (`IsFileMagicCompatible`); extend coverage to the new formats.
8. **Capability detection** — `ModelModality.Audio/Video` already flow through `context.ModelInputModalities`; confirm the resolvers populate them correctly per provider.
9. **Tests** — extend the actor-level delivery test (added in #1265) to cover audio/video reaching the chat client.

## Notes

- Suggest a **phased** approach: **audio first** (tractable via OpenAI `input_audio`), then **video** (provider-specific, may need frame extraction).
- Capability gating must stay strict: never send audio/video to a model that doesn't accept it — fall back to path-only + a model-visible note (consistent with the image path).
- The recent image-delivery fix (#1264 / #1265) is already **modality-agnostic** at the `SessionState`/media-nudge layer, so the persistence/hand-off plumbing won't need rework — the gaps are entirely in the catalog gates, `file_read`, and provider serialization listed above.

Related: #1264, #1265.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add audio and video as native model input to file_read (currently images only) #1266

Summary

Current state

Goal

Touchpoints (for whoever implements)

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add audio and video as native model input to file_read (currently images only) #1266

Description

Summary

Current state

Goal

Touchpoints (for whoever implements)

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions