Skip to content

SImplify schema_text() output for agent CLI introspection #418

@johnnygreco

Description

@johnnygreco

Summary

Benchmark analysis of 30 agent runs shows the data-designer agent schema output lacks nested type details and enum values. Every run needing judge or code columns fell back to Python introspection (help(Score), print(list(CodeLang))), costing 2-3 extra tool calls each. The fix enriches ConfigBase.schema_text() so it renders complete, self-contained type information.

Changes

All changes center on ConfigBase.schema_text() (packages/data-designer-config/src/data_designer/config/base.py) and three small supporting helpers.

1. Skip discriminator and internal fields

_is_discriminator_field(field_info) returns True when the annotation is Literal[x] with exactly one arg matching the default. Fields with repr=False (Pydantic's marker) are also skipped. This removes noise lines like column_type: Literal['llm-text'] = 'llm-text' and allow_resize: bool = False.

2. Show enum values inline

After each field's description, if the annotation contains an Enum subclass, append values: bash, c, python, ....

3. Expand nested ConfigBase models

After each field's description, if the annotation contains a concrete ConfigBase subclass, indent and append its schema_text() output. Limited to 1 level deep via a _depth kwarg. Multi-member unions (discriminated unions like SamplerParamsT) are NOT expanded — agents use agent schema samplers <type> for those.

4. Append instantiation example

After all fields (depth 0 only), append an auto-generated example line:

Example: dd.LLMCodeColumnConfig(name=..., prompt=..., model_alias=..., code_lang=...)

Built from required non-discriminator fields with dd. prefix matching the standard import pattern.

Supporting changes

  • SingleColumnConfig.allow_resize marked with repr=False so schema_text() skips it.
  • DropColumnsProcessorConfig docstring updated to note that most use cases should prefer drop=True on the column config directly.
  • Improved --all + type name error message in get_schema() to suggest correct usage.
  • Added warnings.filterwarnings("ignore", message=".*pyarrow.*") in CLI main.py to suppress PyArrow stderr noise.

Expected output

data-designer agent schema columns llm-code (abbreviated):

LLMCodeColumnConfig:
  Configuration for code generation columns using Large Language Models.

  name: str  [required]
  prompt: str  [required]
      Jinja2 template for code generation prompt...
  model_alias: str  [required]
  code_lang: CodeLang  [required]
      Target programming language or SQL dialect...
      values: bash, c, cobol, cpp, csharp, go, java, javascript, ...

  Example: dd.LLMCodeColumnConfig(name=..., prompt=..., model_alias=..., code_lang=...)

data-designer agent schema columns llm-judge (showing nested expansion):

LLMJudgeColumnConfig:
  ...
  scores: list  [required]
      List of Score objects defining rubric criteria...
      Score:
        Configuration for a "score" in an LLM judge evaluation.

        name: str  [required]
        description: str  [required]
        options: dict  [required]

  Example: dd.LLMJudgeColumnConfig(name=..., prompt=..., model_alias=..., scores=...)

Note: column_type and allow_resize lines are gone from both outputs.

Verification

.venv/bin/pytest packages/data-designer-config/tests/config/test_schema_text.py -v
.venv/bin/pytest packages/data-designer/tests/cli/ -v
make check-all

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions