Skip to content

fix: processor artifacts type, discovery, and loading#366

Merged
andreatgretel merged 7 commits into
mainfrom
andreatgretel/fix/processor-artifacts-cleanup
Mar 6, 2026
Merged

fix: processor artifacts type, discovery, and loading#366
andreatgretel merged 7 commits into
mainfrom
andreatgretel/fix/processor-artifacts-cleanup

Conversation

@andreatgretel

@andreatgretel andreatgretel commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix PreviewResults.processor_artifacts type annotation from dict[str, list[str] | str] to dict[str, list[dict]], matching the actual output of df.to_dict(orient="records")
  • Add list_processor_names() and load_processor_dataset() as standalone functions in data_designer.config.utils.io_helpers - they take a Path and handle both directory-based (batched) and single-file (preview) storage layouts
  • Fix preview bug: discovery now uses disk contents instead of iterating get_processor_configs(), which would crash for processors that don't write artifacts (e.g. DropColumnsProcessor)
  • Simplify DatasetCreationResults.load_processor_dataset() to delegate to ArtifactStorage
  • Align get_processor_file_paths() with list_processor_names() so both methods discover directories and single parquet files consistently

Changes

Fixed

  • PreviewResults.processor_artifacts type annotation (dict[str, list[dict]])
  • Preview processor discovery no longer crashes on processors without artifacts
  • get_processor_file_paths() now discovers single-file processors, not just directories

Changed

  • io_helpers.py - New list_processor_names() and load_processor_dataset() standalone functions
  • artifact_storage.py - Wrapper methods that convert FileNotFoundError to ArtifactStorageError
  • data_designer.py - Preview uses list_processor_names() instead of get_processor_configs()
  • results.py - Delegates to ArtifactStorage.load_processor_dataset()

Test plan

  • Unit tests for list_processor_names (empty, directory, single file, mixed, dedup)
  • Parametrized test for load_processor_dataset (directory and single file layouts)
  • Error type tests (standalone raises FileNotFoundError, ArtifactStorage wraps as ArtifactStorageError)
  • Test for get_processor_file_paths with single-file processors
  • All engine tests pass
  • Pre-commit hooks pass

Description updated with AI

- Fix PreviewResults.processor_artifacts type from dict[str, list[str] | str]
  to dict[str, list[dict]], matching the actual data produced by
  df.to_dict(orient="records")
- Add ArtifactStorage.load_processor_dataset() to centralize loading logic,
  handling both directory-based (batched) and single-file (preview) layouts
- Add ArtifactStorage.list_processor_names() to discover processor outputs
  from disk rather than iterating config, fixing a bug where processors
  that don't write artifacts (e.g. DropColumnsProcessor) would crash the
  library's preview
- Simplify DatasetCreationResults.load_processor_dataset() to delegate to
  ArtifactStorage
@andreatgretel andreatgretel requested a review from a team as a code owner March 3, 2026 16:27
@greptile-apps

greptile-apps Bot commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR successfully fixes several related bugs in processor artifact discovery, type annotations, and loading across the data-designer stack, and introduces two new standalone helper functions (list_processor_names, load_processor_dataset) in io_helpers.py that provide a consistent, layout-agnostic API for both batched (directory) and preview (single-file) parquet storage.

Key changes:

  • PreviewResults.processor_artifacts type corrected from dict[str, list[str] | str] to dict[str, list[dict]], accurately reflecting the output of df.to_dict(orient="records").
  • Preview processor discovery now uses disk contents (list_processor_names()) instead of iterating get_processor_configs(), eliminating crashes for processors like DropColumnsProcessor that don't write artifacts.
  • New list_processor_names() uses a dict for O(1) deduplication so a processor with both a directory and a .parquet file only appears once.
  • get_processor_file_paths() now handles single-file layouts (previously only handled directories), aligning it with list_processor_names().
  • DatasetCreationResults.load_processor_dataset() is simplified to delegate to ArtifactStorage.
  • Test coverage is comprehensive: all new functions, both storage layouts, error type assertions, deduplication, and single-file get_processor_file_paths.

All changes are well-tested and fix real bugs with no regressions.

Confidence Score: 5/5

  • Safe to merge; all changes are well-tested and fix real bugs with no known regressions.
  • All new code paths are covered by parametrized and error-type tests. The deduplication behaviour is verified explicitly. Changes fix concrete issues (processor discovery crash, type annotation mismatch, single-file layout support) with comprehensive test coverage across both storage layouts and error scenarios.
  • No files require special attention.

Last reviewed commit: b5b49e2

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

…ading

Extract list_processor_names() and load_processor_dataset() as module-level
functions that accept a Path, so consumers can use them without constructing
an ArtifactStorage. The ArtifactStorage methods now delegate to these.
Move list_processor_names() and load_processor_dataset() to
data_designer.config.utils.io_helpers so they're usable by any consumer
that depends on the lightweight config package, without needing the
engine's ArtifactStorage. The standalone functions raise FileNotFoundError;
the ArtifactStorage methods wrap this as ArtifactStorageError.
Comment thread packages/data-designer/src/data_designer/interface/data_designer.py
Reuse list_processor_names so both methods discover directories and
single parquet files consistently.
)


def list_processor_names(processors_outputs_path: Path) -> list[str]:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These processor helpers are standalone (not on ArtifactStorage) because the service layer uses them directly without an ArtifactStorage instance.

@mikeknep mikeknep left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, otherwise LGTM, thanks!

@andreatgretel andreatgretel merged commit c64b0a5 into main Mar 6, 2026
47 checks passed
@andreatgretel andreatgretel deleted the andreatgretel/fix/processor-artifacts-cleanup branch April 14, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants