Skip to content

Async engine: multi_modal_context columns missing from required_columns breaks dependency graph #520

@nabinchha

Description

@nabinchha

Bug

LLMTextColumnConfig.required_columns and ImageColumnConfig.required_columns only extract dependency column names from Jinja2 templates in prompt and system_prompt. They do not include columns referenced by multi_modal_context[*].column_name.

Impact

The async engine (DATA_DESIGNER_ASYNC_ENGINE=1) builds an ExecutionGraph from each column config's required_columns to determine task dependencies and dispatch order. When a seed column is referenced only via multi_modal_context (not in the Jinja2 prompt), the execution graph has no edge connecting the seed column to the LLM column that needs it. The scheduler dispatches the LLM cell task before the seed data has been loaded into the row buffer, causing ImageContext.get_contexts to fail with a KeyError on the missing column name.

The sync engine is unaffected because it processes all from_scratch generators first, populating the entire batch buffer before any cell-by-cell generators run.

Reproduction

Any recipe that uses LLMStructuredColumnConfig (or any LLMTextColumnConfig subclass) with multi_modal_context referencing a seed column will fail under DATA_DESIGNER_ASYNC_ENGINE=1:

Non-retryable failure on <column>[rg=0, row=None]: '<multi_modal_column_name>'

Followed by all records being dropped and a DataDesignerGenerationError.

Fix

Include multi_modal_context column names in required_columns for both LLMTextColumnConfig and ImageColumnConfig:

if self.multi_modal_context:
    required_cols.extend(ctx.column_name for ctx in self.multi_modal_context)

Affected files

  • packages/data-designer-config/src/data_designer/config/column_configs.py

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions