Skip to content

bug: default model provider leaks from YAML into DataDesigner constructor #588

@nabinchha

Description

@nabinchha

Summary

DataDesigner.__init__ always reads the default: key from ~/.data-designer/model_providers.yaml and applies it to the runtime ModelProviderRegistry, even when the user supplies their own model_providers list. This causes two related problems:

  1. Hard failure — if the YAML's default names a provider that isn't in the user-supplied list, construction raises ValidationError: Specified default 'X' not found in providers list.
  2. Silent override — if the YAML's default happens to match a provider in the user-supplied list (but not the first one), the documented "first wins" behavior is silently overridden.

Dormant on fresh installs (seed YAML is written without a default: key), but hit immediately by anyone who uses dd config providers "Change default provider", hand-edits the YAML, or relies on a service/plugin that programmatically writes a default.

Repro 1: hard failure

import os, tempfile, yaml
from pathlib import Path

tmp_home = Path(tempfile.mkdtemp(prefix="dd_home_"))
os.environ["DATA_DESIGNER_HOME"] = str(tmp_home)
(tmp_home / "model_providers.yaml").write_text(yaml.safe_dump({
    "default": "nvidia",
    "providers": [{
        "name": "nvidia",
        "endpoint": "https://integrate.api.nvidia.com/v1",
        "provider_type": "openai",
        "api_key": "NVIDIA_API_KEY",
    }],
}))

from data_designer.config.models import ModelProvider
from data_designer.interface.data_designer import DataDesigner

custom_providers = [
    ModelProvider(name="my-vllm", endpoint="https://my-vllm.example.com/v1",
                  provider_type="openai", api_key="MY_VLLM_API_KEY"),
]
DataDesigner(model_providers=custom_providers)
ValidationError: 1 validation error for ModelProviderRegistry
  Value error, Specified default 'nvidia' not found in providers list

Repro 2: silent override

Same setup but YAML has default: foo and user passes [bar, foo] (in that order). Expected default (per the "first wins" documented behavior): bar. Actual: foo.

Root cause

DataDesigner.__init__ passes get_default_provider_name() (which reads the YAML) unconditionally:

https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/packages/data-designer/src/data_designer/interface/data_designer.py#L153-L157

self._model_providers = self._resolve_model_providers(model_providers)
self._mcp_providers = mcp_providers or []
self._model_provider_registry = resolve_model_provider_registry(
    self._model_providers, get_default_provider_name()
)

get_default_provider_name() reads the YAML:

https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/packages/data-designer-config/src/data_designer/config/default_model_settings.py#L97-L98

The resolver then sets it as the registry's default, trusted over model_providers[0].name:

https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/packages/data-designer-engine/src/data_designer/engine/model_provider.py#L70-L78

And the registry's validator hard-rejects a default that isn't in the providers list:

https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/packages/data-designer-engine/src/data_designer/engine/model_provider.py#L47-L51

Existing tests confirm the friction is felt

Two tests in packages/data-designer/tests/interface/test_data_designer.py already work around this by patching get_default_provider_name (lines 861-867 and 901-907). The stub_model_providers fixture has exactly one provider named stub-model-provider and the patch exists purely to prevent the YAML's default from leaking in. No test asserts the buggy behavior — a fix would let those patches drop away.

Suggested fix (minimal, non-breaking)

Only consult the YAML default when the user didn't supply their own providers:

if model_providers is None:
    self._model_providers = self._resolve_model_providers(None)
    default_name = get_default_provider_name()
else:
    self._model_providers = self._resolve_model_providers(model_providers)
    default_name = None  # User-supplied list owns its default (first wins)

self._mcp_providers = mcp_providers or []
self._model_provider_registry = resolve_model_provider_registry(
    self._model_providers, default_name
)

Closes both repros. Doesn't break anything — the two test patches become unnecessary but still pass. No public API change.

Severity

  • Fresh installs: dormant (seed YAML has no default: key).
  • CLI users who set a default: hit immediately.
  • Service/plugin scenarios: high impact. Anywhere a service writes a YAML default and a plugin then constructs DataDesigner with its own providers, this lands.

Related

Architectural follow-ups are tracked separately:

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions