Skip to content

refactor: simplify agent CLI to context, types, and state (#418)#420

Merged
johnnygreco merged 20 commits into
mainfrom
johnny/418/agent-cli-v2
Mar 17, 2026
Merged

refactor: simplify agent CLI to context, types, and state (#418)#420
johnnygreco merged 20 commits into
mainfrom
johnny/418/agent-cli-v2

Conversation

@johnnygreco

@johnnygreco johnnygreco commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Simplifies the agent CLI by removing schema and builder rendering in favor of providing source file paths. The agent can read the config files directly when it needs schema details, which is more flexible and eliminates a complex rendering layer (-430 lines net).

Three subcommands remain: context (bootstrap), types (catalog with descriptions and file paths), and state (model aliases, persona datasets).

Changes

Config layer cleanup

  • Remove ConfigBase.schema_text() and all supporting helpers (_format_annotation, _get_docstring_summary, _GOOGLE_SECTION_HEADERS) from base.py — schema rendering no longer lives in the config layer
  • Delete test_schema_text.py (204 lines)
  • Add docstrings to ScalarInequalityConstraint, ColumnInequalityConstraint
  • Update DropColumnsProcessorConfig docstring to surface drop=True preference

Agent CLI simplification

  • Remove schema and builder subcommands and all supporting code
  • Add "Config Module" section to context output with root (absolute path to config module), builder, base, and import — explicitly tells the agent this is the only part of the codebase to work with
  • Add description (docstring first paragraph) and file (relative to config root) to the types table
  • Family file paths are relative to config module root for direct resolution
  • Clean section hierarchy: ## for sections, ### for family sub-tables

Robustness fixes (from Greptile review)

  • _get_source_file uses last occurrence of data_designer in path parts (handles nested dev checkouts)
  • _get_source_file returns empty string (not absolute path) when data_designer is absent
  • get_family_source_files returns deduplicated list (handles types spanning multiple files)
  • config_builder_file resolved dynamically via _get_source_file, not hardcoded
  • Added parameters: and params: to docstring section headers

Example output

Data Designer v0.5.4

## Config Module

root: /path/to/site-packages/data_designer/config
This is the only part of the codebase agents need to work with. All files below are relative to root.

builder: config_builder.py
base: base.py (read for inherited fields shared by columns and processors)
import: import data_designer.config as dd

## Types

### columns
file: column_configs.py

type_name       description
--------------  -------------------------------------------------------
llm-text        Configuration for text generation columns using LLMs.
llm-judge       Configuration for LLM-as-a-judge quality assessment.
...

Test plan

  • 957 tests pass (config + CLI)
  • Verified all 4 subcommands produce correct output
  • Verified removed commands (schema, builder) return "No such command"

@johnnygreco johnnygreco requested a review from a team as a code owner March 16, 2026 01:49
@greptile-apps

greptile-apps Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR simplifies the agent CLI from six subcommands to four (context, types, state.model-aliases, state.persona-datasets) by replacing the schema and builder subcommands with source file path references. Instead of rendering schemas inline, the agent is now told exactly which config files to read directly. The net result is -430 lines and a cleaner, more flexible interface. Config docstrings across the package are improved with (required) annotations and Inherited Attributes: sections to better serve _get_first_paragraph extraction.

Key changes:

  • ConfigBase.schema_text() and all supporting helpers removed from base.py; equivalent (improved) logic re-implemented in agent_introspection.py as _get_first_paragraph + _SECTION_HEADERS
  • New get_config_module_path(), get_family_source_files(), and _get_source_file() power path-based context
  • format_context_text now includes a ## Config Module section with config_root, builder path, and base path, so agents can directly open source files
  • _format_model_aliases_context replaces the old inline aliases summary, correctly showing per-row reasons and a clear guidance message when no usable aliases exist
  • All previously reviewed fixes (last-occurrence data_designer index, .as_posix(), deduplicated family files, dynamic builder file resolution, _SECTION_HEADERS additions) are present

One remaining issue: _get_source_file still returns full_path.as_posix() (an absolute path) when data_designer is absent from the path, despite the PR description and previous review reply both stating it was fixed to return an empty string. When such an absolute path passes through _format_family_header, it produces a malformed file: {config_root}//absolute/path entry. This only affects plugin types registered outside the data_designer package, but the discrepancy between the stated fix and the implementation is worth resolving before merge.

Confidence Score: 4/5

  • Safe to merge for all production workloads; one edge-case rendering bug only affects hypothetical plugin types outside the data_designer package.
  • The refactor is well-scoped, all 957 tests pass, and every substantive issue from the previous review round has been addressed. The single remaining problem — _get_source_file returning an absolute path instead of empty string for non-data_designer types — only manifests for plugin types that live entirely outside the package (none currently exist), and it produces a malformed {config_root}//... agent output line rather than a crash. The fix is a one-liner. All other logic, test coverage, path handling, and docstring improvements are correct.
  • packages/data-designer/src/data_designer/cli/utils/agent_introspection.py — specifically the _get_source_file fallback at line 300-301.

Important Files Changed

Filename Overview
packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Major refactor removing schema/builder APIs and adding path-based introspection (_get_source_file, get_family_source_files, get_config_module_path). All previous review fixes (last-occurrence index, posix paths, _SECTION_HEADERS additions, dynamic builder file resolution) are correctly implemented. One remaining issue: _get_source_file still returns an absolute path (not empty string) when data_designer is absent from the path, contradicting the PR description and producing malformed {config_root}//absolute/path output in _format_family_header.
packages/data-designer/src/data_designer/cli/utils/agent_text_formatter.py Clean rewrite replacing schema/builder formatters with path-based family headers and _format_model_aliases_context. {config_root} template pattern is used consistently. _strip_config_prefix correctly strips data_designer/config/ for built-in types; the acknowledged passthrough for other paths is only problematic when _get_source_file returns an absolute path (see introspection comment). _format_model_aliases_context gracefully handles the no-usable-aliases and empty-items cases.
packages/data-designer/src/data_designer/cli/commands/agent.py Straightforward removal of schema_command and builder_command and their imports. Remaining commands (context, types, state model-aliases, state persona-datasets) are unchanged in structure.
packages/data-designer-config/src/data_designer/config/base.py Removes schema_text(), _format_annotation, _get_docstring_summary, _GOOGLE_SECTION_HEADERS, and now-unused re, Enum, Literal, get_args, get_origin imports. Clean deletion with no regressions; the same docstring-parsing logic has been re-implemented (with improvements) in agent_introspection.py.
packages/data-designer-config/src/data_designer/config/column_configs.py Docstring updates only: adds (required) annotations on mandatory fields and Inherited Attributes: sections, and removes bare name: str redeclaration from ExpressionColumnConfig. No behavioural changes.
packages/data-designer-config/src/data_designer/config/processors.py Docstring improvements for DropColumnsProcessorConfig and SchemaTransformProcessorConfig. Bare name: str redeclarations were removed (resolving the previously flagged issue where the parent Field(description=...) metadata was being silently dropped).
packages/data-designer/tests/cli/utils/test_agent_text_formatter.py Tests updated to match the new formatter output: removed schema/builder tests, added test_format_context_text_includes_config_module_path, test_format_context_text_no_usable_aliases_shows_warning, and updated fixture paths to use realistic data_designer/config/... paths that exercise _strip_config_prefix.
packages/data-designer/tests/cli/utils/test_agent_introspection.py Tests updated: removed schema/builder tests, added test_get_family_catalog_includes_description, test_get_family_source_files_returns_relative_paths, and test_get_config_module_path_returns_config_directory. Assertions are appropriate and not fragile.
packages/data-designer-config/src/data_designer/config/sampler_constraints.py Added missing docstrings to ScalarInequalityConstraint and ColumnInequalityConstraint, including Inherited Attributes: sections. No code changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    CLI["data-designer agent <cmd>"] --> CTX["context"]
    CLI --> TYPES["types [family]"]
    CLI --> SMA["state model-aliases"]
    CLI --> SPD["state persona-datasets"]

    CTX --> GC["get_context()"]
    GC --> GCM["get_config_module_path()\n→ .../data_designer/config"]
    GC --> GCB["_get_config_builder_file()\n→ _get_source_file(DataDesignerConfigBuilder)"]
    GC --> GBF["_get_base_config_file()\n→ _get_source_file(ConfigBase)"]
    GC --> GFC["get_family_catalog(family)\n→ [{type: cls.__name__, description: ...}]"]
    GC --> GFS["get_family_source_files(family)\n→ [data_designer/config/foo.py, ...]"]

    TYPES --> GT["get_types(family?)"]
    GT --> GFC
    GT --> GFS

    GFS --> GSF["_get_source_file(cls)\n→ relative path or abs path"]
    GSF -->|data_designer in path| REL["data_designer/config/foo.py"]
    GSF -->|data_designer NOT in path| ABS["full_path.as_posix() ⚠️"]

    GCM --> FCT["format_context_text(data)\n## Config Module section"]
    GFC --> FCT
    GFS --> FCT
    FCT --> FFH["_format_family_header(info)\nfile: {config_root}/stripped_path"]
    FFH --> SCP["_strip_config_prefix(path)\nstrips data_designer/config/"]
    SCP -->|normal| OK["file: {config_root}/foo.py ✓"]
    SCP -->|absolute path| BAD["file: {config_root}//abs/path ✗"]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer/src/data_designer/cli/utils/agent_introspection.py
Line: 300-302

Comment:
**Absolute path fallback produces malformed `{config_root}/` output**

The PR description explicitly states `_get_source_file` was fixed to "return empty string (not absolute path) when `data_designer` is absent", and the previous review reply confirmed the same ("Fixed — returns empty string now"). However, the implementation still returns `full_path.as_posix()` for plugin types whose path contains no `data_designer` component.

When this absolute path (e.g. `/home/user/my_plugin.py`) propagates through `get_family_source_files` — which only guards against `if path` (non-empty) — it reaches `_format_family_header`, which renders:

```
file: {config_root}//home/user/my_plugin.py
```

The double slash makes this a malformed agent-facing path that cannot be resolved by substituting the `{config_root}` template variable. The docstring now documents "absolute path" intent, but the rendering layer in `_format_family_header` (line 85 of `agent_text_formatter.py`) unconditionally prepends `{config_root}/` and only strips the `data_designer/config/` prefix — it has no fallback for absolute paths.

Returning empty string here (consistent with the `TypeError`/`OSError` branch above) would let `get_family_source_files`' `if path and` guard silently drop the entry and avoid the malformed output:

```suggestion
    if not indices:
        return ""
    return Path(*parts[indices[-1] :]).as_posix()
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 162322a

Comment thread packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Outdated
Comment thread packages/data-designer/tests/cli/utils/test_agent_introspection.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Outdated
Comment thread packages/data-designer/tests/cli/utils/test_agent_introspection.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Outdated
Comment thread packages/data-designer-config/src/data_designer/config/processors.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_text_formatter.py Outdated
Comment thread packages/data-designer/tests/cli/utils/test_agent_text_formatter.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_text_formatter.py Outdated
Comment thread packages/data-designer/tests/cli/utils/test_agent_text_formatter.py
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_introspection.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_text_formatter.py Outdated
nabinchha
nabinchha previously approved these changes Mar 16, 2026
Comment thread packages/data-designer/src/data_designer/cli/utils/agent_text_formatter.py Outdated
andreatgretel
andreatgretel previously approved these changes Mar 16, 2026
- Remove schema and builder subcommands and all supporting code
- Add description column (docstring first paragraph) to types table
- Add config_file per family (relative to data_designer package)
- Add config_package_path and library_version to context output
- Clean section hierarchy: ## for sections, ### for family sub-tables
- Add docstrings to ScalarInequalityConstraint and ColumnInequalityConstraint
- Remove unused get_import_path (only used by deleted schema/builder)
- Remove unused class_name from catalog dicts
- Fix N+1: get_family_source_file uses get_args directly instead of
  rediscovering all types via discover_family_types
- Add parameters:/params: to _SECTION_HEADERS for docstring parsing
- Fix config_package_path to return parent of data_designer package so
  Path(base) / relative_file resolves correctly
- Use last occurrence of data_designer in _get_source_file to handle
  nested paths (e.g. dev checkouts)
- Return list of deduplicated files per family (get_family_source_files)
  instead of assuming all types live in one file
- Add config_builder_file to context output
- Use _get_source_file(DataDesignerConfigBuilder) instead of hardcoded
  string for config_builder_file, consistent with family file resolution
- Fix test assertion that assumed "config" in path (only true in dev)
- _get_source_file returns "" instead of absolute path when
  data_designer is not in the path, consistent with error branch
- Add Config Module section to context output pointing agent to
  the config module as the only part of the codebase to work with
- Rename config_package_path to config_module_path (returns config dir)
Schema rendering is no longer needed in the config layer — the agent
CLI now provides file paths so agents can read source files directly.
- Redeclare `name: str` in DropColumnsProcessorConfig and
  SchemaTransformProcessorConfig so agents see the required field
  without reading the base class
- Add base config file path to agent context output
- Optimize agent context formatting: strip redundant path prefixes,
  remove family count summary, separate usable/unusable model aliases,
  rename sections for clarity
- Remove bare name: str redeclarations in processor configs that
  silently dropped the parent Field(description=...)
- Use Path.as_posix() in _get_source_file for consistent forward slashes
…ited Attributes

- Add (required) to all required parameters in Attributes sections
- Add Inherited Attributes section to all config subclasses listing
  fields from parent classes (SingleColumnConfig, ProcessorConfig, Constraint)
- Fix stale with_trace descriptions in LLM subclass inherited sections
- Remove discriminator fields from Attributes sections
- Remove redundant name: str redeclaration from ExpressionColumnConfig
- Show per-alias reason for unusable models instead of blanket
  "missing API keys" label
- Surface model_config_present: tell agent when no config file exists
- Fix test fixtures to use realistic data_designer/config/ paths that
  exercise _strip_config_prefix
Move `name (required)` to the top of the Inherited Attributes section
in LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig
so required fields appear before optional ones.
- Use {config_root}/file.py path syntax across all agent output
- Add config_root preamble to standalone `agent types` output
- Replace type_name (discriminator) with type (class name) in tables
- Show only usable model aliases; warn agent to surface config issues
- Add directive scoping agents to the config module only
- Reword import hint and config module description for directness
_get_source_file() returned "" for types outside the data_designer
package (e.g., plugin configs). Now returns the absolute path so
the agent still gets a readable file reference.
main() calls ensure_cli_default_model_settings() before any agent
command, so model config is always seeded. The model_config_present=False
branch was dead code.
Covers the remaining branch in _format_model_aliases_context where
all aliases are unusable and the agent gets a warning to surface to
the user.
Address two Greptile review comments:
- Add "inherited attributes:" to _SECTION_HEADERS so docstring parsing
  stops before that section even without a preceding blank line.
- Use .as_posix() in get_config_module_path() for consistent
  forward-slash paths across platforms.
@johnnygreco johnnygreco dismissed stale reviews from andreatgretel and nabinchha via 84e9a2c March 17, 2026 03:52
@johnnygreco johnnygreco force-pushed the johnny/418/agent-cli-v2 branch from d6659fd to 84e9a2c Compare March 17, 2026 03:52
@johnnygreco johnnygreco merged commit 164db0a into main Mar 17, 2026
47 checks passed
Comment on lines +300 to +302
if not indices:
return full_path.as_posix()
return Path(*parts[indices[-1] :]).as_posix()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Absolute path fallback produces malformed {config_root}/ output

The PR description explicitly states _get_source_file was fixed to "return empty string (not absolute path) when data_designer is absent", and the previous review reply confirmed the same ("Fixed — returns empty string now"). However, the implementation still returns full_path.as_posix() for plugin types whose path contains no data_designer component.

When this absolute path (e.g. /home/user/my_plugin.py) propagates through get_family_source_files — which only guards against if path (non-empty) — it reaches _format_family_header, which renders:

file: {config_root}//home/user/my_plugin.py

The double slash makes this a malformed agent-facing path that cannot be resolved by substituting the {config_root} template variable. The docstring now documents "absolute path" intent, but the rendering layer in _format_family_header (line 85 of agent_text_formatter.py) unconditionally prepends {config_root}/ and only strips the data_designer/config/ prefix — it has no fallback for absolute paths.

Returning empty string here (consistent with the TypeError/OSError branch above) would let get_family_source_files' if path and guard silently drop the entry and avoid the malformed output:

Suggested change
if not indices:
return full_path.as_posix()
return Path(*parts[indices[-1] :]).as_posix()
if not indices:
return ""
return Path(*parts[indices[-1] :]).as_posix()
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer/src/data_designer/cli/utils/agent_introspection.py
Line: 300-302

Comment:
**Absolute path fallback produces malformed `{config_root}/` output**

The PR description explicitly states `_get_source_file` was fixed to "return empty string (not absolute path) when `data_designer` is absent", and the previous review reply confirmed the same ("Fixed — returns empty string now"). However, the implementation still returns `full_path.as_posix()` for plugin types whose path contains no `data_designer` component.

When this absolute path (e.g. `/home/user/my_plugin.py`) propagates through `get_family_source_files` — which only guards against `if path` (non-empty) — it reaches `_format_family_header`, which renders:

```
file: {config_root}//home/user/my_plugin.py
```

The double slash makes this a malformed agent-facing path that cannot be resolved by substituting the `{config_root}` template variable. The docstring now documents "absolute path" intent, but the rendering layer in `_format_family_header` (line 85 of `agent_text_formatter.py`) unconditionally prepends `{config_root}/` and only strips the `data_designer/config/` prefix — it has no fallback for absolute paths.

Returning empty string here (consistent with the `TypeError`/`OSError` branch above) would let `get_family_source_files`' `if path and` guard silently drop the entry and avoid the malformed output:

```suggestion
    if not indices:
        return ""
    return Path(*parts[indices[-1] :]).as_posix()
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants