fix: processor artifacts type, discovery, and loading by andreatgretel · Pull Request #366 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-03-03T16:27:48Z

Summary

Fix PreviewResults.processor_artifacts type annotation from dict[str, list[str] | str] to dict[str, list[dict]], matching the actual output of df.to_dict(orient="records")
Add list_processor_names() and load_processor_dataset() as standalone functions in data_designer.config.utils.io_helpers - they take a Path and handle both directory-based (batched) and single-file (preview) storage layouts
Fix preview bug: discovery now uses disk contents instead of iterating get_processor_configs(), which would crash for processors that don't write artifacts (e.g. DropColumnsProcessor)
Simplify DatasetCreationResults.load_processor_dataset() to delegate to ArtifactStorage
Align get_processor_file_paths() with list_processor_names() so both methods discover directories and single parquet files consistently

Changes

Fixed

PreviewResults.processor_artifacts type annotation (dict[str, list[dict]])
Preview processor discovery no longer crashes on processors without artifacts
get_processor_file_paths() now discovers single-file processors, not just directories

Changed

io_helpers.py - New list_processor_names() and load_processor_dataset() standalone functions
artifact_storage.py - Wrapper methods that convert FileNotFoundError to ArtifactStorageError
data_designer.py - Preview uses list_processor_names() instead of get_processor_configs()
results.py - Delegates to ArtifactStorage.load_processor_dataset()

Test plan

Unit tests for list_processor_names (empty, directory, single file, mixed, dedup)
Parametrized test for load_processor_dataset (directory and single file layouts)
Error type tests (standalone raises FileNotFoundError, ArtifactStorage wraps as ArtifactStorageError)
Test for get_processor_file_paths with single-file processors
All engine tests pass
Pre-commit hooks pass

Description updated with AI

- Fix PreviewResults.processor_artifacts type from dict[str, list[str] | str] to dict[str, list[dict]], matching the actual data produced by df.to_dict(orient="records") - Add ArtifactStorage.load_processor_dataset() to centralize loading logic, handling both directory-based (batched) and single-file (preview) layouts - Add ArtifactStorage.list_processor_names() to discover processor outputs from disk rather than iterating config, fixing a bug where processors that don't write artifacts (e.g. DropColumnsProcessor) would crash the library's preview - Simplify DatasetCreationResults.load_processor_dataset() to delegate to ArtifactStorage

greptile-apps · 2026-03-03T16:32:44Z

Greptile Summary

This PR successfully fixes several related bugs in processor artifact discovery, type annotations, and loading across the data-designer stack, and introduces two new standalone helper functions (list_processor_names, load_processor_dataset) in io_helpers.py that provide a consistent, layout-agnostic API for both batched (directory) and preview (single-file) parquet storage.

Key changes:

PreviewResults.processor_artifacts type corrected from dict[str, list[str] | str] to dict[str, list[dict]], accurately reflecting the output of df.to_dict(orient="records").
Preview processor discovery now uses disk contents (list_processor_names()) instead of iterating get_processor_configs(), eliminating crashes for processors like DropColumnsProcessor that don't write artifacts.
New list_processor_names() uses a dict for O(1) deduplication so a processor with both a directory and a .parquet file only appears once.
get_processor_file_paths() now handles single-file layouts (previously only handled directories), aligning it with list_processor_names().
DatasetCreationResults.load_processor_dataset() is simplified to delegate to ArtifactStorage.
Test coverage is comprehensive: all new functions, both storage layouts, error type assertions, deduplication, and single-file get_processor_file_paths.

All changes are well-tested and fix real bugs with no regressions.

Confidence Score: 5/5

Safe to merge; all changes are well-tested and fix real bugs with no known regressions.
All new code paths are covered by parametrized and error-type tests. The deduplication behaviour is verified explicitly. Changes fix concrete issues (processor discovery crash, type annotation mismatch, single-file layout support) with comprehensive test coverage across both storage layouts and error scenarios.
No files require special attention.

_{Last reviewed commit: b5b49e2}

greptile-apps

_{6 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

…ading Extract list_processor_names() and load_processor_dataset() as module-level functions that accept a Path, so consumers can use them without constructing an ArtifactStorage. The ArtifactStorage methods now delegate to these.

Move list_processor_names() and load_processor_dataset() to data_designer.config.utils.io_helpers so they're usable by any consumer that depends on the lightweight config package, without needing the engine's ArtifactStorage. The standalone functions raise FileNotFoundError; the ArtifactStorage methods wrap this as ArtifactStorageError.

Reuse list_processor_names so both methods discover directories and single parquet files consistently.

andreatgretel · 2026-03-03T17:49:37Z

        )


+def list_processor_names(processors_outputs_path: Path) -> list[str]:


These processor helpers are standalone (not on ArtifactStorage) because the service layer uses them directly without an ArtifactStorage instance.

mikeknep

One question, otherwise LGTM, thanks!

andreatgretel requested a review from a team as a code owner March 3, 2026 16:27

greptile-apps Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py Outdated

andreatgretel added 3 commits March 3, 2026 13:35

fix: deduplicate list_processor_names when both dir and file exist

531c761

greptile-apps Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread packages/data-designer-config/src/data_designer/config/utils/io_helpers.py

chore: trim docstrings and deduplicate tests

0a740f3

greptile-apps Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread packages/data-designer/src/data_designer/interface/data_designer.py

fix: align get_processor_file_paths with list_processor_names

f74a8e5

Reuse list_processor_names so both methods discover directories and single parquet files consistently.

andreatgretel commented Mar 3, 2026

View reviewed changes

mikeknep approved these changes Mar 3, 2026

View reviewed changes

Comment thread packages/data-designer-config/src/data_designer/config/preview_results.py

Merge branch 'main' into andreatgretel/fix/processor-artifacts-cleanup

b5b49e2

andreatgretel merged commit c64b0a5 into main Mar 6, 2026
47 checks passed

andreatgretel deleted the andreatgretel/fix/processor-artifacts-cleanup branch April 14, 2026 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: processor artifacts type, discovery, and loading#366

fix: processor artifacts type, discovery, and loading#366
andreatgretel merged 7 commits into
mainfrom
andreatgretel/fix/processor-artifacts-cleanup

andreatgretel commented Mar 3, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 3, 2026 •

edited

Loading

Confidence Score: 5/5

Uh oh!

greptile-apps Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel Mar 3, 2026

Uh oh!

mikeknep left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		)


		def list_processor_names(processors_outputs_path: Path) -> list[str]:

Conversation

andreatgretel commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Fixed

Changed

Test plan

Uh oh!

greptile-apps Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mikeknep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andreatgretel commented Mar 3, 2026 •

edited

Loading

greptile-apps Bot commented Mar 3, 2026 •

edited

Loading

greptile-apps Bot left a comment •

edited

Loading