What happens
Assign a model from an openai-compatible provider (llama.cpp, vLLM, llama-server, and friends) with netclaw model set, and the saved config comes back with "InputModalities": "Text" and "OutputModalities": "Text" — even when the model handles images or video. Once that value is on disk the daemon never corrects it. A config-level modality is treated as authoritative and short-circuits capability resolution at startup, so the vLLM/llama.cpp backend strategy, the OpenRouter oracle, and the HuggingFace resolver never get a vote.
The visible result is a multimodal model that quietly behaves as text-only. Image attachments get dropped before they ever reach the model, and netclaw daemon status reports input: Text.
Why
The OpenAI-style /v1/models listing has no modality field, so the parser never sets one. It builds each DiscoveredModel with an id and a context window and nothing else:
|
if (model.TryGetProperty("id", out var id)) |
|
{ |
|
models.Add(new DiscoveredModel |
|
{ |
|
ModelId = new(id.GetString()!), |
|
ContextWindowTokens = readContextWindow(model) |
|
}); |
|
} |
- (openai-compatible routes straight through that helper)
|
internal static ProviderProbeResult ParseModels(string json) |
|
=> ProbeHelpers.ParseOpenAiStyleModels(json, TryReadContextWindow); |
DiscoveredModel then falls back to its defaults, which are Text:
|
public ModelModality InputModalities { get; init; } = ModelModality.Text; |
|
|
|
/// <summary>Content types the model can produce as output.</summary> |
|
public ModelModality OutputModalities { get; init; } = ModelModality.Text; |
So discovery hands back Text not because it detected a text-only model, but because it never looked. The persistence step then takes that default and writes it as if an operator had deliberately chosen it:
|
if (discoveredModel is not null) |
|
{ |
|
modelEntry["InputModalities"] = discoveredModel.InputModalities.ToString(); |
|
modelEntry["OutputModalities"] = discoveredModel.OutputModalities.ToString(); |
That is the trap. An unknown gets promoted to a hard assertion, and on the next daemon boot the override beats real detection.
Same bug, second code path
The init wizard does the identical thing through its own code, so a fix in one place won't cover the other:
|
|
|
builder.Model = new ModelConfigSection |
|
{ |
|
Provider = providerName, |
|
ModelId = SelectedModelId, |
|
ContextWindow = selectedModel?.ContextWindowTokens, |
|
Provenance = selectedModel is null ? ModelDiscoverySource.Manual : ModelDiscoverySource.Live, |
|
InputModalities = selectedModel?.InputModalities, |
|
OutputModalities = selectedModel?.OutputModalities, |
|
}; |
|
} |
Worth fixing both together, and ideally collapsing them onto one shared write path.
Suggested direction
- When discovery can't actually determine modalities, leave them unset rather than defaulting to
Text, and don't persist a value the provider never reported. Unknown should mean "let the daemon resolve this," not "Text, forever."
- Only write modalities to config when they come from a source that genuinely knows. The OpenAI Codex OAuth catalog, for example, already resolves real
input_modalities/output_modalities; a self-hosted /v1/models listing does not.
Repro
- Configure an
openai-compatible provider pointed at a server hosting a vision-capable model.
netclaw model set Main <provider> <model-id>
- Look at the saved model entry:
InputModalities is Text.
netclaw daemon status reports text-only input, and image attachments are dropped, regardless of the model's real capability.
Related
#1267 is about surfacing modalities in model discover/list/TUI. This is the upstream cause of the bad data it would surface: discovery defaults modalities to Text and then bakes them into config as an override.
What happens
Assign a model from an
openai-compatibleprovider (llama.cpp, vLLM, llama-server, and friends) withnetclaw model set, and the saved config comes back with"InputModalities": "Text"and"OutputModalities": "Text"— even when the model handles images or video. Once that value is on disk the daemon never corrects it. A config-level modality is treated as authoritative and short-circuits capability resolution at startup, so the vLLM/llama.cpp backend strategy, the OpenRouter oracle, and the HuggingFace resolver never get a vote.The visible result is a multimodal model that quietly behaves as text-only. Image attachments get dropped before they ever reach the model, and
netclaw daemon statusreportsinput: Text.Why
The OpenAI-style
/v1/modelslisting has no modality field, so the parser never sets one. It builds eachDiscoveredModelwith an id and a context window and nothing else:netclaw/src/Netclaw.Providers/ProbeHelpers.cs
Lines 36 to 43 in 60601c6
netclaw/src/Netclaw.Providers/SelfHosted/OpenAiCompatibleDescriptor.cs
Lines 50 to 51 in 60601c6
DiscoveredModelthen falls back to its defaults, which areText:netclaw/src/Netclaw.Configuration/DiscoveredModel.cs
Lines 31 to 34 in 60601c6
So discovery hands back
Textnot because it detected a text-only model, but because it never looked. The persistence step then takes that default and writes it as if an operator had deliberately chosen it:netclaw/src/Netclaw.Cli/Model/ModelCommand.cs
Lines 183 to 186 in 60601c6
That is the trap. An unknown gets promoted to a hard assertion, and on the next daemon boot the override beats real detection.
Same bug, second code path
The init wizard does the identical thing through its own code, so a fix in one place won't cover the other:
netclaw/src/Netclaw.Cli/Tui/Wizard/Steps/ProviderStepViewModel.cs
Lines 319 to 329 in 60601c6
Worth fixing both together, and ideally collapsing them onto one shared write path.
Suggested direction
Text, and don't persist a value the provider never reported. Unknown should mean "let the daemon resolve this," not "Text, forever."input_modalities/output_modalities; a self-hosted/v1/modelslisting does not.Repro
openai-compatibleprovider pointed at a server hosting a vision-capable model.netclaw model set Main <provider> <model-id>InputModalitiesisText.netclaw daemon statusreports text-only input, and image attachments are dropped, regardless of the model's real capability.Related
#1267 is about surfacing modalities in
model discover/list/TUI. This is the upstream cause of the bad data it would surface: discovery defaults modalities toTextand then bakes them into config as an override.