Problem
When all records in a batch fail during generation, the profiler raises a misleading DatasetProfilerConfigurationError: Column '<name>' not found in dataset. This error suggests a configuration problem — that the user misconfigured their column names — when the actual issue is that the dataset is empty because every record was dropped due to generation failures.
This affects any column type where generation can fail (LLM text, LLM code, LLM structured, image, etc.), not just image columns.
Reproduction
import data_designer.config as dd
from data_designer.interface import DataDesigner
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder(model_configs=[
dd.ModelConfig(
alias="my-model",
model="some-model/that-is-down",
provider="some-provider",
),
])
config_builder.add_column(
dd.SamplerColumnConfig(
name="topic",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(values=["science", "history", "art"]),
)
)
config_builder.add_column(
dd.LLMTextColumnConfig(
name="explanation",
prompt="Write a short explanation about {{topic}}.",
model_alias="my-model",
)
)
# If the LLM fails for every record, the dataset is empty and the profiler
# raises: DatasetProfilerConfigurationError: Column 'topic' not found in dataset
preview_results = data_designer.preview(config_builder=config_builder, num_records=1)
Observed behavior
The error traceback shows:
DatasetProfilerConfigurationError: Column 'topic' not found in dataset
During handling of the above exception, another exception occurred:
DataDesignerProfilingError: 🛑 Error profiling preview dataset: Column 'topic' not found in dataset
This is confusing because:
- The column
topic is correctly configured as a SamplerColumnConfig
- The real problem is that the LLM column failed for all records, resulting in an empty dataset
- The user has to carefully read the earlier warning log (
"Generation for record at index 0 failed. Will omit this record from the dataset.") to understand what actually went wrong
Expected behavior
The error should clearly indicate that the dataset is empty due to generation failures, not suggest a schema mismatch. Ideally, the user should see something like:
"Dataset is empty — all N records were dropped due to generation failures. Check the warnings above for details."
Proposed solution
Handle the empty dataset case early, before schema validation. Two changes:
1. Early return in DataDesignerDatasetProfiler.profile_dataset() for empty datasets
In packages/data-designer-engine/src/data_designer/engine/analysis/dataset_profiler.py, add an early check at the top of profile_dataset():
def profile_dataset(self, target_num_records: int, dataset: pd.DataFrame) -> DatasetProfilerResults:
logger.info("📐 Measuring dataset column statistics:")
if len(dataset) == 0:
logger.warning(
"⚠️ Dataset is empty — all records were dropped during generation. "
"Skipping profiling. Check the warnings above for details on why records failed."
)
return DatasetProfilerResults(
num_records=0,
target_num_records=target_num_records,
side_effect_column_names=[],
column_statistics=[],
column_profiles=None,
)
self._validate_schema_consistency(list(dataset.columns))
# ... rest of method
2. Surface a clear warning in DataDesigner.preview() / DataDesigner.create()
In packages/data-designer/src/data_designer/interface/data_designer.py, after generation and before profiling, log a warning when the dataset is empty:
if len(processed_dataset) == 0:
logger.warning(
"⚠️ No records were successfully generated. "
"All records were dropped due to generation failures."
)
Affected code paths
DataDesigner.preview() (data_designer.py:276-280)
DataDesigner.create() (data_designer.py:215-222)
DataDesignerDatasetProfiler.profile_dataset() (dataset_profiler.py:64-103)
DataDesignerDatasetProfiler._validate_schema_consistency() (dataset_profiler.py:145-148)
Problem
When all records in a batch fail during generation, the profiler raises a misleading
DatasetProfilerConfigurationError: Column '<name>' not found in dataset. This error suggests a configuration problem — that the user misconfigured their column names — when the actual issue is that the dataset is empty because every record was dropped due to generation failures.This affects any column type where generation can fail (LLM text, LLM code, LLM structured, image, etc.), not just image columns.
Reproduction
Observed behavior
The error traceback shows:
This is confusing because:
topicis correctly configured as aSamplerColumnConfig"Generation for record at index 0 failed. Will omit this record from the dataset.") to understand what actually went wrongExpected behavior
The error should clearly indicate that the dataset is empty due to generation failures, not suggest a schema mismatch. Ideally, the user should see something like:
Proposed solution
Handle the empty dataset case early, before schema validation. Two changes:
1. Early return in
DataDesignerDatasetProfiler.profile_dataset()for empty datasetsIn
packages/data-designer-engine/src/data_designer/engine/analysis/dataset_profiler.py, add an early check at the top ofprofile_dataset():2. Surface a clear warning in
DataDesigner.preview()/DataDesigner.create()In
packages/data-designer/src/data_designer/interface/data_designer.py, after generation and before profiling, log a warning when the dataset is empty:Affected code paths
DataDesigner.preview()(data_designer.py:276-280)DataDesigner.create()(data_designer.py:215-222)DataDesignerDatasetProfiler.profile_dataset()(dataset_profiler.py:64-103)DataDesignerDatasetProfiler._validate_schema_consistency()(dataset_profiler.py:145-148)