Skip to content

[Bug]: models.input type union rejects "video" / "audio" — blocks Gemini native multimodal config #20721

@wu-tian807

Description

@wu-tian807

Summary

Model input config (models.providers[].models[].input) rejects "video" and "audio" values at zod validation, preventing users from declaring Gemini native multimodal capabilities — even though MAX_VIDEO_BYTES, MAX_AUDIO_BYTES, and MediaUnderstandingCapabilitiesSchema in the same codebase already support them.

Steps to reproduce

  1. Add a model override in openclaw.json:
{
  "models": {
    "providers": {
      "google": {
        "models": [
          {
            "id": "gemini-2.5-flash",
            "input": ["text", "image", "video"]
          }
        ]
      }
    }
  }
}
  1. Run openclaw doctor or start the agent.
  2. Config validation fails with zod error: Invalid input at models.providers.google.models.0.input.2 (expected "text" or "image", got "video")

Expected behavior

Config accepts "video" and "audio" in the input array, since the runtime already has infrastructure for these media types (MAX_VIDEO_BYTES, MAX_AUDIO_BYTES, mediaKindFromMime(), maxBytesForKind()).

Actual behavior

Zod schema at src/config/zod-schema.core.ts:41 rejects any value other than "text" or "image". Validation in src/config/validation.ts:103 (OpenClawSchema.safeParse()) produces:

Invalid input at models.providers.google.models.0.input.2
  — expected "text" or "image"

Notably, MediaUnderstandingCapabilitiesSchema in the same file (line 405) already accepts ["image", "audio", "video"].

OpenClaw version

Source (main branch as of 2026-02-19, commit b228c06)

Operating system

Linux (TencentOS Server 4, kernel 6.6.98)

Install method

docker compose

Logs, screenshots, and evidence

No runtime error — the failure is at config validation time. The zod schema statically rejects the value before any code runs.

Relevant source locations:

| File | Line | Constraint |
|------|------|------------|
| `src/config/zod-schema.core.ts` | 41 | `z.array(z.union([z.literal("text"), z.literal("image")]))` |
| `src/config/zod-schema.core.ts` | 405 | `MediaUnderstandingCapabilitiesSchema` allows `["image", "audio", "video"]` |
| `src/config/types.models.ts` | 31 | `input: Array<"text" \| "image">` |
| `src/agents/model-catalog.ts` | 11, 20 | `input?: Array<"text" \| "image">` |
| `src/media/constants.ts` | 2-3 | `MAX_AUDIO_BYTES` (16MB), `MAX_VIDEO_BYTES` (16MB) already exist |

Impact and severity

  • Affected: Users running Gemini models that support native video/audio input via inlineData
  • Severity: Blocks config — cannot declare video/audio capabilities through the existing override path
  • Frequency: 100% repro — deterministic zod validation failure
  • Consequence: Users must rely on lossy media-understanding transcription pipeline instead of native multimodal; no workaround exists within the config system

Additional information

Affected locations beyond the zod schema

The "text" | "image" constraint also appears in:

  • src/agents/model-scan.ts:101parseModality() return type
  • src/agents/bedrock-discovery.ts:58,60 — Bedrock model mapping
  • src/agents/cloudflare-ai-gateway.ts:20 — Cloudflare model config
  • src/agents/huggingface-models.ts:202 — HuggingFace model parsing
  • src/commands/onboard-auth.config-litellm.ts:23 — LiteLLM onboarding

Suggested fix

Widen the type union:

// Before:
input: Array<"text" | "image">

// After:
input: Array<"text" | "image" | "video" | "audio">

Fully backward compatible — existing configs without "video" / "audio" are unaffected.

Note on pi-ai model catalog

The auto-generated model catalog (models.generated.ts from @mariozechner/pi-ai) also only emits ["text", "image"] for Gemini models. Even if pi-ai updates its generation script, openclaw's own type constraint still needs to be widened independently — they are separate TypeScript codebases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions