Summary
Benchmark analysis of 30 agent runs shows the data-designer agent schema output lacks nested type details and enum values. Every run needing judge or code columns fell back to Python introspection (help(Score), print(list(CodeLang))), costing 2-3 extra tool calls each. The fix enriches ConfigBase.schema_text() so it renders complete, self-contained type information.
Changes
All changes center on ConfigBase.schema_text() (packages/data-designer-config/src/data_designer/config/base.py) and three small supporting helpers.
1. Skip discriminator and internal fields
_is_discriminator_field(field_info) returns True when the annotation is Literal[x] with exactly one arg matching the default. Fields with repr=False (Pydantic's marker) are also skipped. This removes noise lines like column_type: Literal['llm-text'] = 'llm-text' and allow_resize: bool = False.
2. Show enum values inline
After each field's description, if the annotation contains an Enum subclass, append values: bash, c, python, ....
3. Expand nested ConfigBase models
After each field's description, if the annotation contains a concrete ConfigBase subclass, indent and append its schema_text() output. Limited to 1 level deep via a _depth kwarg. Multi-member unions (discriminated unions like SamplerParamsT) are NOT expanded — agents use agent schema samplers <type> for those.
4. Append instantiation example
After all fields (depth 0 only), append an auto-generated example line:
Example: dd.LLMCodeColumnConfig(name=..., prompt=..., model_alias=..., code_lang=...)
Built from required non-discriminator fields with dd. prefix matching the standard import pattern.
Supporting changes
SingleColumnConfig.allow_resize marked with repr=False so schema_text() skips it.
DropColumnsProcessorConfig docstring updated to note that most use cases should prefer drop=True on the column config directly.
- Improved
--all + type name error message in get_schema() to suggest correct usage.
- Added
warnings.filterwarnings("ignore", message=".*pyarrow.*") in CLI main.py to suppress PyArrow stderr noise.
Expected output
data-designer agent schema columns llm-code (abbreviated):
LLMCodeColumnConfig:
Configuration for code generation columns using Large Language Models.
name: str [required]
prompt: str [required]
Jinja2 template for code generation prompt...
model_alias: str [required]
code_lang: CodeLang [required]
Target programming language or SQL dialect...
values: bash, c, cobol, cpp, csharp, go, java, javascript, ...
Example: dd.LLMCodeColumnConfig(name=..., prompt=..., model_alias=..., code_lang=...)
data-designer agent schema columns llm-judge (showing nested expansion):
LLMJudgeColumnConfig:
...
scores: list [required]
List of Score objects defining rubric criteria...
Score:
Configuration for a "score" in an LLM judge evaluation.
name: str [required]
description: str [required]
options: dict [required]
Example: dd.LLMJudgeColumnConfig(name=..., prompt=..., model_alias=..., scores=...)
Note: column_type and allow_resize lines are gone from both outputs.
Verification
.venv/bin/pytest packages/data-designer-config/tests/config/test_schema_text.py -v
.venv/bin/pytest packages/data-designer/tests/cli/ -v
make check-all
Summary
Benchmark analysis of 30 agent runs shows the
data-designer agent schemaoutput lacks nested type details and enum values. Every run needing judge or code columns fell back to Python introspection (help(Score),print(list(CodeLang))), costing 2-3 extra tool calls each. The fix enrichesConfigBase.schema_text()so it renders complete, self-contained type information.Changes
All changes center on
ConfigBase.schema_text()(packages/data-designer-config/src/data_designer/config/base.py) and three small supporting helpers.1. Skip discriminator and internal fields
_is_discriminator_field(field_info)returnsTruewhen the annotation isLiteral[x]with exactly one arg matching the default. Fields withrepr=False(Pydantic's marker) are also skipped. This removes noise lines likecolumn_type: Literal['llm-text'] = 'llm-text'andallow_resize: bool = False.2. Show enum values inline
After each field's description, if the annotation contains an
Enumsubclass, appendvalues: bash, c, python, ....3. Expand nested ConfigBase models
After each field's description, if the annotation contains a concrete
ConfigBasesubclass, indent and append itsschema_text()output. Limited to 1 level deep via a_depthkwarg. Multi-member unions (discriminated unions likeSamplerParamsT) are NOT expanded — agents useagent schema samplers <type>for those.4. Append instantiation example
After all fields (depth 0 only), append an auto-generated example line:
Built from required non-discriminator fields with
dd.prefix matching the standard import pattern.Supporting changes
SingleColumnConfig.allow_resizemarked withrepr=Falsesoschema_text()skips it.DropColumnsProcessorConfigdocstring updated to note that most use cases should preferdrop=Trueon the column config directly.--all+ type name error message inget_schema()to suggest correct usage.warnings.filterwarnings("ignore", message=".*pyarrow.*")in CLImain.pyto suppress PyArrow stderr noise.Expected output
data-designer agent schema columns llm-code(abbreviated):data-designer agent schema columns llm-judge(showing nested expansion):Note:
column_typeandallow_resizelines are gone from both outputs.Verification