[Bug]: models.input type union rejects "video" / "audio" — blocks Gemini native multimodal config

### Summary

Model input config (`models.providers[].models[].input`) rejects `"video"` and `"audio"` values at zod validation, preventing users from declaring Gemini native multimodal capabilities — even though `MAX_VIDEO_BYTES`, `MAX_AUDIO_BYTES`, and `MediaUnderstandingCapabilitiesSchema` in the same codebase already support them.

### Steps to reproduce

1. Add a model override in `openclaw.json`:
```json
{
  "models": {
    "providers": {
      "google": {
        "models": [
          {
            "id": "gemini-2.5-flash",
            "input": ["text", "image", "video"]
          }
        ]
      }
    }
  }
}
```
2. Run `openclaw doctor` or start the agent.
3. Config validation fails with zod error: `Invalid input` at `models.providers.google.models.0.input.2` (expected `"text"` or `"image"`, got `"video"`)

### Expected behavior

Config accepts `"video"` and `"audio"` in the `input` array, since the runtime already has infrastructure for these media types (`MAX_VIDEO_BYTES`, `MAX_AUDIO_BYTES`, `mediaKindFromMime()`, `maxBytesForKind()`).

### Actual behavior

Zod schema at `src/config/zod-schema.core.ts:41` rejects any value other than `"text"` or `"image"`. Validation in `src/config/validation.ts:103` (`OpenClawSchema.safeParse()`) produces:
```
Invalid input at models.providers.google.models.0.input.2
  — expected "text" or "image"
```
Notably, `MediaUnderstandingCapabilitiesSchema` in the same file (line 405) already accepts `["image", "audio", "video"]`.

### OpenClaw version

Source (main branch as of 2026-02-19, commit b228c06)

### Operating system

Linux (TencentOS Server 4, kernel 6.6.98)

### Install method

docker compose

### Logs, screenshots, and evidence

```shell
No runtime error — the failure is at config validation time. The zod schema statically rejects the value before any code runs.

Relevant source locations:

| File | Line | Constraint |
|------|------|------------|
| `src/config/zod-schema.core.ts` | 41 | `z.array(z.union([z.literal("text"), z.literal("image")]))` |
| `src/config/zod-schema.core.ts` | 405 | `MediaUnderstandingCapabilitiesSchema` allows `["image", "audio", "video"]` |
| `src/config/types.models.ts` | 31 | `input: Array<"text" \| "image">` |
| `src/agents/model-catalog.ts` | 11, 20 | `input?: Array<"text" \| "image">` |
| `src/media/constants.ts` | 2-3 | `MAX_AUDIO_BYTES` (16MB), `MAX_VIDEO_BYTES` (16MB) already exist |
```

### Impact and severity

- **Affected:** Users running Gemini models that support native video/audio input via `inlineData`
- **Severity:** Blocks config — cannot declare video/audio capabilities through the existing override path
- **Frequency:** 100% repro — deterministic zod validation failure
- **Consequence:** Users must rely on lossy `media-understanding` transcription pipeline instead of native multimodal; no workaround exists within the config system

### Additional information

### Affected locations beyond the zod schema

The `"text" | "image"` constraint also appears in:
- `src/agents/model-scan.ts:101` — `parseModality()` return type
- `src/agents/bedrock-discovery.ts:58,60` — Bedrock model mapping
- `src/agents/cloudflare-ai-gateway.ts:20` — Cloudflare model config
- `src/agents/huggingface-models.ts:202` — HuggingFace model parsing
- `src/commands/onboard-auth.config-litellm.ts:23` — LiteLLM onboarding

### Suggested fix

Widen the type union:

```typescript
// Before:
input: Array<"text" | "image">

// After:
input: Array<"text" | "image" | "video" | "audio">
```

Fully backward compatible — existing configs without `"video"` / `"audio"` are unaffected.

### Note on pi-ai model catalog

The auto-generated model catalog (`models.generated.ts` from `@mariozechner/pi-ai`) also only emits `["text", "image"]` for Gemini models. Even if pi-ai updates its generation script, openclaw's own type constraint still needs to be widened independently — they are separate TypeScript codebases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: models.input type union rejects "video" / "audio" — blocks Gemini native multimodal config #20721

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Logs, screenshots, and evidence

Impact and severity

Additional information

Affected locations beyond the zod schema

Suggested fix

Note on pi-ai model catalog

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: models.input type union rejects "video" / "audio" — blocks Gemini native multimodal config #20721

Description

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Logs, screenshots, and evidence

Impact and severity

Additional information

Affected locations beyond the zod schema

Suggested fix

Note on pi-ai model catalog

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions