Problem Statement
During nemoclaw onboard, models discovered from local or self-hosted providers may be registered as text-only, even when the underlying model supports multimodal input such as text+image.
For example, when using nvidia/nemotron-3-nano-omni-30b-a3b-reasoning, the model can be selected successfully during onboarding, but the generated OpenClaw model configuration may register it as:
Screenshot after manual override showing the desired result:

Proposed Design
nemoclaw onboard should provide a supported way to override the input capability of a selected model when provider-side modality detection is incomplete.
A possible design:
- During model selection, after the user selects a model, show an input capability prompt:
Selected model:
nvidia/nemotron-3-nano-omni-30b-a3b-reasoning
Input capability:
1. Text only
2. Text + Image
Alternatives Considered
- Always default all discovered models to
text+image.
This would simplify onboarding, but it may be unsafe or misleading for pure text models. Some models or providers may reject image input, so a manual override is safer than changing the default for all models.
- Rely entirely on automatic provider metadata.
This is ideal when providers expose reliable modality metadata, but many OpenAI-compatible or self-hosted endpoints do not. In these cases, auto-discovery may only return the model id and context length, not whether image input is supported.
- Manually edit generated OpenClaw configuration files after onboarding.
This works as a workaround, but it is not a good user experience. Users need to know where the generated model catalog is located, which fields to modify, and when to restart the gateway. It is also easy to lose changes after recreating or re-onboarding a sandbox.
- Configure the image model after onboarding only.
Setting an image model after onboarding is not enough if the model catalog still declares the selected model as text-only. The model input capability itself also needs to be configurable.
Category
enhancement: feature
Checklist
Problem Statement
During
nemoclaw onboard, models discovered from local or self-hosted providers may be registered as text-only, even when the underlying model supports multimodal input such as text+image.For example, when using
nvidia/nemotron-3-nano-omni-30b-a3b-reasoning, the model can be selected successfully during onboarding, but the generated OpenClaw model configuration may register it as:Screenshot after manual override showing the desired result:
Proposed Design
nemoclaw onboardshould provide a supported way to override the input capability of a selected model when provider-side modality detection is incomplete.A possible design:
Alternatives Considered
text+image.This would simplify onboarding, but it may be unsafe or misleading for pure text models. Some models or providers may reject image input, so a manual override is safer than changing the default for all models.
This is ideal when providers expose reliable modality metadata, but many OpenAI-compatible or self-hosted endpoints do not. In these cases, auto-discovery may only return the model id and context length, not whether image input is supported.
This works as a workaround, but it is not a good user experience. Users need to know where the generated model catalog is located, which fields to modify, and when to restart the gateway. It is also easy to lose changes after recreating or re-onboarding a sandbox.
Setting an image model after onboarding is not enough if the model catalog still declares the selected model as text-only. The model input capability itself also needs to be configurable.
Category
enhancement: feature
Checklist