Summary
Model input config (models.providers[].models[].input) rejects "video" and "audio" values at zod validation, preventing users from declaring Gemini native multimodal capabilities — even though MAX_VIDEO_BYTES, MAX_AUDIO_BYTES, and MediaUnderstandingCapabilitiesSchema in the same codebase already support them.
Steps to reproduce
- Add a model override in
openclaw.json:
{
"models": {
"providers": {
"google": {
"models": [
{
"id": "gemini-2.5-flash",
"input": ["text", "image", "video"]
}
]
}
}
}
}
- Run
openclaw doctor or start the agent.
- Config validation fails with zod error:
Invalid input at models.providers.google.models.0.input.2 (expected "text" or "image", got "video")
Expected behavior
Config accepts "video" and "audio" in the input array, since the runtime already has infrastructure for these media types (MAX_VIDEO_BYTES, MAX_AUDIO_BYTES, mediaKindFromMime(), maxBytesForKind()).
Actual behavior
Zod schema at src/config/zod-schema.core.ts:41 rejects any value other than "text" or "image". Validation in src/config/validation.ts:103 (OpenClawSchema.safeParse()) produces:
Invalid input at models.providers.google.models.0.input.2
— expected "text" or "image"
Notably, MediaUnderstandingCapabilitiesSchema in the same file (line 405) already accepts ["image", "audio", "video"].
OpenClaw version
Source (main branch as of 2026-02-19, commit b228c06)
Operating system
Linux (TencentOS Server 4, kernel 6.6.98)
Install method
docker compose
Logs, screenshots, and evidence
No runtime error — the failure is at config validation time. The zod schema statically rejects the value before any code runs.
Relevant source locations:
| File | Line | Constraint |
|------|------|------------|
| `src/config/zod-schema.core.ts` | 41 | `z.array(z.union([z.literal("text"), z.literal("image")]))` |
| `src/config/zod-schema.core.ts` | 405 | `MediaUnderstandingCapabilitiesSchema` allows `["image", "audio", "video"]` |
| `src/config/types.models.ts` | 31 | `input: Array<"text" \| "image">` |
| `src/agents/model-catalog.ts` | 11, 20 | `input?: Array<"text" \| "image">` |
| `src/media/constants.ts` | 2-3 | `MAX_AUDIO_BYTES` (16MB), `MAX_VIDEO_BYTES` (16MB) already exist |
Impact and severity
- Affected: Users running Gemini models that support native video/audio input via
inlineData
- Severity: Blocks config — cannot declare video/audio capabilities through the existing override path
- Frequency: 100% repro — deterministic zod validation failure
- Consequence: Users must rely on lossy
media-understanding transcription pipeline instead of native multimodal; no workaround exists within the config system
Additional information
Affected locations beyond the zod schema
The "text" | "image" constraint also appears in:
src/agents/model-scan.ts:101 — parseModality() return type
src/agents/bedrock-discovery.ts:58,60 — Bedrock model mapping
src/agents/cloudflare-ai-gateway.ts:20 — Cloudflare model config
src/agents/huggingface-models.ts:202 — HuggingFace model parsing
src/commands/onboard-auth.config-litellm.ts:23 — LiteLLM onboarding
Suggested fix
Widen the type union:
// Before:
input: Array<"text" | "image">
// After:
input: Array<"text" | "image" | "video" | "audio">
Fully backward compatible — existing configs without "video" / "audio" are unaffected.
Note on pi-ai model catalog
The auto-generated model catalog (models.generated.ts from @mariozechner/pi-ai) also only emits ["text", "image"] for Gemini models. Even if pi-ai updates its generation script, openclaw's own type constraint still needs to be widened independently — they are separate TypeScript codebases.
Summary
Model input config (
models.providers[].models[].input) rejects"video"and"audio"values at zod validation, preventing users from declaring Gemini native multimodal capabilities — even thoughMAX_VIDEO_BYTES,MAX_AUDIO_BYTES, andMediaUnderstandingCapabilitiesSchemain the same codebase already support them.Steps to reproduce
openclaw.json:{ "models": { "providers": { "google": { "models": [ { "id": "gemini-2.5-flash", "input": ["text", "image", "video"] } ] } } } }openclaw doctoror start the agent.Invalid inputatmodels.providers.google.models.0.input.2(expected"text"or"image", got"video")Expected behavior
Config accepts
"video"and"audio"in theinputarray, since the runtime already has infrastructure for these media types (MAX_VIDEO_BYTES,MAX_AUDIO_BYTES,mediaKindFromMime(),maxBytesForKind()).Actual behavior
Zod schema at
src/config/zod-schema.core.ts:41rejects any value other than"text"or"image". Validation insrc/config/validation.ts:103(OpenClawSchema.safeParse()) produces:Notably,
MediaUnderstandingCapabilitiesSchemain the same file (line 405) already accepts["image", "audio", "video"].OpenClaw version
Source (main branch as of 2026-02-19, commit b228c06)
Operating system
Linux (TencentOS Server 4, kernel 6.6.98)
Install method
docker compose
Logs, screenshots, and evidence
Impact and severity
inlineDatamedia-understandingtranscription pipeline instead of native multimodal; no workaround exists within the config systemAdditional information
Affected locations beyond the zod schema
The
"text" | "image"constraint also appears in:src/agents/model-scan.ts:101—parseModality()return typesrc/agents/bedrock-discovery.ts:58,60— Bedrock model mappingsrc/agents/cloudflare-ai-gateway.ts:20— Cloudflare model configsrc/agents/huggingface-models.ts:202— HuggingFace model parsingsrc/commands/onboard-auth.config-litellm.ts:23— LiteLLM onboardingSuggested fix
Widen the type union:
Fully backward compatible — existing configs without
"video"/"audio"are unaffected.Note on pi-ai model catalog
The auto-generated model catalog (
models.generated.tsfrom@mariozechner/pi-ai) also only emits["text", "image"]for Gemini models. Even if pi-ai updates its generation script, openclaw's own type constraint still needs to be widened independently — they are separate TypeScript codebases.