Skip to content

Misleading DatasetProfilerConfigurationError when all records fail during generation #382

@nabinchha

Description

@nabinchha

Problem

When all records in a batch fail during generation, the profiler raises a misleading DatasetProfilerConfigurationError: Column '<name>' not found in dataset. This error suggests a configuration problem — that the user misconfigured their column names — when the actual issue is that the dataset is empty because every record was dropped due to generation failures.

This affects any column type where generation can fail (LLM text, LLM code, LLM structured, image, etc.), not just image columns.

Reproduction

import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()

config_builder = dd.DataDesignerConfigBuilder(model_configs=[
    dd.ModelConfig(
        alias="my-model",
        model="some-model/that-is-down",
        provider="some-provider",
    ),
])

config_builder.add_column(
    dd.SamplerColumnConfig(
        name="topic",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(values=["science", "history", "art"]),
    )
)

config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="explanation",
        prompt="Write a short explanation about {{topic}}.",
        model_alias="my-model",
    )
)

# If the LLM fails for every record, the dataset is empty and the profiler
# raises: DatasetProfilerConfigurationError: Column 'topic' not found in dataset
preview_results = data_designer.preview(config_builder=config_builder, num_records=1)

Observed behavior

The error traceback shows:

DatasetProfilerConfigurationError: Column 'topic' not found in dataset

During handling of the above exception, another exception occurred:

DataDesignerProfilingError: 🛑 Error profiling preview dataset: Column 'topic' not found in dataset

This is confusing because:

  • The column topic is correctly configured as a SamplerColumnConfig
  • The real problem is that the LLM column failed for all records, resulting in an empty dataset
  • The user has to carefully read the earlier warning log ("Generation for record at index 0 failed. Will omit this record from the dataset.") to understand what actually went wrong

Expected behavior

The error should clearly indicate that the dataset is empty due to generation failures, not suggest a schema mismatch. Ideally, the user should see something like:

"Dataset is empty — all N records were dropped due to generation failures. Check the warnings above for details."

Proposed solution

Handle the empty dataset case early, before schema validation. Two changes:

1. Early return in DataDesignerDatasetProfiler.profile_dataset() for empty datasets

In packages/data-designer-engine/src/data_designer/engine/analysis/dataset_profiler.py, add an early check at the top of profile_dataset():

def profile_dataset(self, target_num_records: int, dataset: pd.DataFrame) -> DatasetProfilerResults:
    logger.info("📐 Measuring dataset column statistics:")

    if len(dataset) == 0:
        logger.warning(
            "⚠️ Dataset is empty — all records were dropped during generation. "
            "Skipping profiling. Check the warnings above for details on why records failed."
        )
        return DatasetProfilerResults(
            num_records=0,
            target_num_records=target_num_records,
            side_effect_column_names=[],
            column_statistics=[],
            column_profiles=None,
        )

    self._validate_schema_consistency(list(dataset.columns))
    # ... rest of method

2. Surface a clear warning in DataDesigner.preview() / DataDesigner.create()

In packages/data-designer/src/data_designer/interface/data_designer.py, after generation and before profiling, log a warning when the dataset is empty:

if len(processed_dataset) == 0:
    logger.warning(
        "⚠️ No records were successfully generated. "
        "All records were dropped due to generation failures."
    )

Affected code paths

  • DataDesigner.preview() (data_designer.py:276-280)
  • DataDesigner.create() (data_designer.py:215-222)
  • DataDesignerDatasetProfiler.profile_dataset() (dataset_profiler.py:64-103)
  • DataDesignerDatasetProfiler._validate_schema_consistency() (dataset_profiler.py:145-148)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions