Skip to content

Allow nemoclaw onboard to override model input capability for multimodal models #3850

@zouchengli

Description

@zouchengli

Problem Statement

During nemoclaw onboard, models discovered from local or self-hosted providers may be registered as text-only, even when the underlying model supports multimodal input such as text+image.

For example, when using nvidia/nemotron-3-nano-omni-30b-a3b-reasoning, the model can be selected successfully during onboarding, but the generated OpenClaw model configuration may register it as:

Image

Screenshot after manual override showing the desired result:

OpenClaw model listed as text+image

Proposed Design

nemoclaw onboard should provide a supported way to override the input capability of a selected model when provider-side modality detection is incomplete.

A possible design:

  1. During model selection, after the user selects a model, show an input capability prompt:
Selected model:
nvidia/nemotron-3-nano-omni-30b-a3b-reasoning

Input capability:
  1. Text only
  2. Text + Image

Alternatives Considered

  1. Always default all discovered models to text+image.

This would simplify onboarding, but it may be unsafe or misleading for pure text models. Some models or providers may reject image input, so a manual override is safer than changing the default for all models.

  1. Rely entirely on automatic provider metadata.

This is ideal when providers expose reliable modality metadata, but many OpenAI-compatible or self-hosted endpoints do not. In these cases, auto-discovery may only return the model id and context length, not whether image input is supported.

  1. Manually edit generated OpenClaw configuration files after onboarding.

This works as a workaround, but it is not a good user experience. Users need to know where the generated model catalog is located, which fields to modify, and when to restart the gateway. It is also easy to lose changes after recreating or re-onboarding a sandbox.

  1. Configure the image model after onboarding only.

Setting an image model after onboarding is not enough if the model catalog still declares the selected model as text-only. The model input capability itself also needs to be configurable.

Category

enhancement: feature

Checklist

  • I searched existing issues and this is not a duplicate
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: cliCommand line interface, flags, terminal UX, or outputarea: inferenceInference routing, serving, model selection, or outputs
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions