Skip to content

openai-compatible discovery defaults modalities to Text, then the CLI persists it as a permanent override #1290

@Aaronontheweb

Description

@Aaronontheweb

What happens

Assign a model from an openai-compatible provider (llama.cpp, vLLM, llama-server, and friends) with netclaw model set, and the saved config comes back with "InputModalities": "Text" and "OutputModalities": "Text" — even when the model handles images or video. Once that value is on disk the daemon never corrects it. A config-level modality is treated as authoritative and short-circuits capability resolution at startup, so the vLLM/llama.cpp backend strategy, the OpenRouter oracle, and the HuggingFace resolver never get a vote.

The visible result is a multimodal model that quietly behaves as text-only. Image attachments get dropped before they ever reach the model, and netclaw daemon status reports input: Text.

Why

The OpenAI-style /v1/models listing has no modality field, so the parser never sets one. It builds each DiscoveredModel with an id and a context window and nothing else:

DiscoveredModel then falls back to its defaults, which are Text:

  • public ModelModality InputModalities { get; init; } = ModelModality.Text;
    /// <summary>Content types the model can produce as output.</summary>
    public ModelModality OutputModalities { get; init; } = ModelModality.Text;

So discovery hands back Text not because it detected a text-only model, but because it never looked. The persistence step then takes that default and writes it as if an operator had deliberately chosen it:

  • if (discoveredModel is not null)
    {
    modelEntry["InputModalities"] = discoveredModel.InputModalities.ToString();
    modelEntry["OutputModalities"] = discoveredModel.OutputModalities.ToString();

That is the trap. An unknown gets promoted to a hard assertion, and on the next daemon boot the override beats real detection.

Same bug, second code path

The init wizard does the identical thing through its own code, so a fix in one place won't cover the other:

  • builder.Model = new ModelConfigSection
    {
    Provider = providerName,
    ModelId = SelectedModelId,
    ContextWindow = selectedModel?.ContextWindowTokens,
    Provenance = selectedModel is null ? ModelDiscoverySource.Manual : ModelDiscoverySource.Live,
    InputModalities = selectedModel?.InputModalities,
    OutputModalities = selectedModel?.OutputModalities,
    };
    }

Worth fixing both together, and ideally collapsing them onto one shared write path.

Suggested direction

  • When discovery can't actually determine modalities, leave them unset rather than defaulting to Text, and don't persist a value the provider never reported. Unknown should mean "let the daemon resolve this," not "Text, forever."
  • Only write modalities to config when they come from a source that genuinely knows. The OpenAI Codex OAuth catalog, for example, already resolves real input_modalities/output_modalities; a self-hosted /v1/models listing does not.

Repro

  1. Configure an openai-compatible provider pointed at a server hosting a vision-capable model.
  2. netclaw model set Main <provider> <model-id>
  3. Look at the saved model entry: InputModalities is Text.
  4. netclaw daemon status reports text-only input, and image attachments are dropped, regardless of the model's real capability.

Related

#1267 is about surfacing modalities in model discover/list/TUI. This is the upstream cause of the bad data it would surface: discovery defaults modalities to Text and then bakes them into config as an override.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingconfigConfiguration issues, netclaw doctor, schema validation.providersProvider integrations and capability detection across OpenAI-compatible backends.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions