fix: processor artifacts type, discovery, and loading#366
Merged
andreatgretel merged 7 commits intoMar 6, 2026
Conversation
- Fix PreviewResults.processor_artifacts type from dict[str, list[str] | str] to dict[str, list[dict]], matching the actual data produced by df.to_dict(orient="records") - Add ArtifactStorage.load_processor_dataset() to centralize loading logic, handling both directory-based (batched) and single-file (preview) layouts - Add ArtifactStorage.list_processor_names() to discover processor outputs from disk rather than iterating config, fixing a bug where processors that don't write artifacts (e.g. DropColumnsProcessor) would crash the library's preview - Simplify DatasetCreationResults.load_processor_dataset() to delegate to ArtifactStorage
Contributor
Greptile SummaryThis PR successfully fixes several related bugs in processor artifact discovery, type annotations, and loading across the data-designer stack, and introduces two new standalone helper functions ( Key changes:
All changes are well-tested and fix real bugs with no regressions.
|
…ading Extract list_processor_names() and load_processor_dataset() as module-level functions that accept a Path, so consumers can use them without constructing an ArtifactStorage. The ArtifactStorage methods now delegate to these.
Move list_processor_names() and load_processor_dataset() to data_designer.config.utils.io_helpers so they're usable by any consumer that depends on the lightweight config package, without needing the engine's ArtifactStorage. The standalone functions raise FileNotFoundError; the ArtifactStorage methods wrap this as ArtifactStorageError.
Reuse list_processor_names so both methods discover directories and single parquet files consistently.
andreatgretel
commented
Mar 3, 2026
| ) | ||
|
|
||
|
|
||
| def list_processor_names(processors_outputs_path: Path) -> list[str]: |
Contributor
Author
There was a problem hiding this comment.
These processor helpers are standalone (not on ArtifactStorage) because the service layer uses them directly without an ArtifactStorage instance.
mikeknep
approved these changes
Mar 3, 2026
mikeknep
left a comment
Contributor
There was a problem hiding this comment.
One question, otherwise LGTM, thanks!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PreviewResults.processor_artifactstype annotation fromdict[str, list[str] | str]todict[str, list[dict]], matching the actual output ofdf.to_dict(orient="records")list_processor_names()andload_processor_dataset()as standalone functions indata_designer.config.utils.io_helpers- they take aPathand handle both directory-based (batched) and single-file (preview) storage layoutsget_processor_configs(), which would crash for processors that don't write artifacts (e.g.DropColumnsProcessor)DatasetCreationResults.load_processor_dataset()to delegate toArtifactStorageget_processor_file_paths()withlist_processor_names()so both methods discover directories and single parquet files consistentlyChanges
Fixed
PreviewResults.processor_artifactstype annotation (dict[str, list[dict]])get_processor_file_paths()now discovers single-file processors, not just directoriesChanged
io_helpers.py- Newlist_processor_names()andload_processor_dataset()standalone functionsartifact_storage.py- Wrapper methods that convertFileNotFoundErrortoArtifactStorageErrordata_designer.py- Preview useslist_processor_names()instead ofget_processor_configs()results.py- Delegates toArtifactStorage.load_processor_dataset()Test plan
list_processor_names(empty, directory, single file, mixed, dedup)load_processor_dataset(directory and single file layouts)FileNotFoundError,ArtifactStoragewraps asArtifactStorageError)get_processor_file_pathswith single-file processorsDescription updated with AI